In this study I will interrogate the data from the Olympic events to get some interesting insight. The data are from kaggle/piterfm and from 126 Years of Historical Olympic Dataset Even though the data cover a period that goes from 1896, in this study i will focus on the games after the downfall of the USSR, otherwise we will have the result from the USSR team until the downfall and after that, the result of all the new-born states. TO avoid that, and also to reduce the amount of relevant data i will consider only the Olympic games from the 1992 edition. Therefore, unfortunately: - There are no results for qualification rounds. For instance, event 100-m men contains only final results without semi-finals and other hits. - There is no information about athletes for team competitions that consist of more than 2 participants. Only team records.
Colors
Graphic functions
These datasets present a detailed country-wise record of Olympic medals from the first modern Olympics in 1896 to the most recent games in 2024. It provides insights into how different nations have performed over time, including their gold, silver, and bronze medal counts, overall rankings, and total medal tally.
This last dataset contatin the data from Olympedia and provide some additional information about the athlete that compete in the Olympic games.
## spc_tbl_ [75,904 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ athlete_url : chr [1:75904] "https://olympics.com/en/athletes/cooper-woods-topalovic" "https://olympics.com/en/athletes/elofsson" "https://olympics.com/en/athletes/dylan-walczyk" "https://olympics.com/en/athletes/olli-penttala" ...
## $ athlete_full_name : chr [1:75904] "Cooper WOODS-TOPALOVIC" "Felix ELOFSSON" "Dylan WALCZYK" "Olli PENTTALA" ...
## $ games_participations: num [1:75904] 1 2 1 1 1 3 2 2 1 2 ...
## $ first_game : chr [1:75904] "Beijing 2022" "PyeongChang 2018" "Beijing 2022" "Beijing 2022" ...
## $ athlete_year_birth : num [1:75904] 2000 1995 1993 1995 1989 ...
## $ athlete_medals : chr [1:75904] NA NA NA NA ...
## $ bio : chr [1:75904] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. athlete_url = col_character(),
## .. athlete_full_name = col_character(),
## .. games_participations = col_double(),
## .. first_game = col_character(),
## .. athlete_year_birth = col_double(),
## .. athlete_medals = col_character(),
## .. bio = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I will now operate the following operation on the dataset:
## spc_tbl_ [75,904 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ athlete_full_name: chr [1:75904] "Dylan WALCZYK" "Dmitriy REIKHERD" "Felix ELOFSSON" "Olli PENTTALA" ...
## $ athlete_url : chr [1:75904] "https://olympics.com/en/athletes/dylan-walczyk" "https://olympics.com/en/athletes/reikherd" "https://olympics.com/en/athletes/elofsson" "https://olympics.com/en/athletes/olli-penttala" ...
## $ sex : chr [1:75904] "Male" NA "Male" "Male" ...
## $ height : chr [1:75904] NA NA "184 cm" NA ...
## $ weight : chr [1:75904] NA NA "84 kg" NA ...
## $ NOC : chr [1:75904] "United States" NA "Sweden" "Finland" ...
## $ NOC_code : chr [1:75904] "USA" NA "SWE" "FIN" ...
## - attr(*, "spec")=
## .. cols(
## .. athlete_full_name = col_character(),
## .. athlete_url = col_character(),
## .. sex = col_character(),
## .. height = col_character(),
## .. weight = col_character(),
## .. NOC = col_character(),
## .. NOC_code = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I’ll perform some necessary format correction to the dataset due to simplify the future operations.
## spc_tbl_ [21,697 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ discipline_title : chr [1:21697] "Curling" "Curling" "Curling" "Curling" ...
## $ slug_game : chr [1:21697] "beijing-2022" "beijing-2022" "beijing-2022" "beijing-2022" ...
## $ event_title : chr [1:21697] "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" ...
## $ event_gender : chr [1:21697] "Mixed" "Mixed" "Mixed" "Mixed" ...
## $ medal_type : chr [1:21697] "GOLD" "GOLD" "SILVER" "SILVER" ...
## $ participant_type : chr [1:21697] "GameTeam" "GameTeam" "GameTeam" "GameTeam" ...
## $ participant_title : chr [1:21697] "Italy" "Italy" "Norway" "Norway" ...
## $ athlete_url : chr [1:21697] "https://olympics.com/en/athletes/stefania-constantini" "https://olympics.com/en/athletes/amos-mosaner" "https://olympics.com/en/athletes/kristin-skaslien" "https://olympics.com/en/athletes/magnus-nedregotten" ...
## $ athlete_full_name : chr [1:21697] "Stefania CONSTANTINI" "Amos MOSANER" "Kristin SKASLIEN" "Magnus NEDREGOTTEN" ...
## $ country_name : chr [1:21697] "Italy" "Italy" "Norway" "Norway" ...
## $ country_code : chr [1:21697] "IT" "IT" "NO" "NO" ...
## $ country_3_letter_code: chr [1:21697] "ITA" "ITA" "NOR" "NOR" ...
## - attr(*, "spec")=
## .. cols(
## .. discipline_title = col_character(),
## .. slug_game = col_character(),
## .. event_title = col_character(),
## .. event_gender = col_character(),
## .. medal_type = col_character(),
## .. participant_type = col_character(),
## .. participant_title = col_character(),
## .. athlete_url = col_character(),
## .. athlete_full_name = col_character(),
## .. country_name = col_character(),
## .. country_code = col_character(),
## .. country_3_letter_code = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I’ll perform some necessary format correction to the dataset due to simplify the future operations.
## spc_tbl_ [162,804 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ discipline_title : chr [1:162804] "Curling" "Curling" "Curling" "Curling" ...
## $ event_title : chr [1:162804] "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" "Mixed Doubles" ...
## $ slug_game : chr [1:162804] "beijing-2022" "beijing-2022" "beijing-2022" "beijing-2022" ...
## $ participant_type : chr [1:162804] "GameTeam" "GameTeam" "GameTeam" "GameTeam" ...
## $ medal_type : chr [1:162804] "GOLD" "SILVER" "BRONZE" NA ...
## $ athletes : chr [1:162804] "[('Stefania CONSTANTINI', 'https://olympics.com/en/athletes/stefania-constantini'), ('Amos MOSANER', 'https://o"| __truncated__ "[('Kristin SKASLIEN', 'https://olympics.com/en/athletes/kristin-skaslien'), ('Magnus NEDREGOTTEN', 'https://oly"| __truncated__ "[('Almida DE VAL', 'https://olympics.com/en/athletes/almida-de-val'), ('Oskar ERIKSSON', 'https://olympics.com/"| __truncated__ "[('Jennifer DODDS', 'https://olympics.com/en/athletes/jennifer-dodds'), ('Bruce MOUAT', 'https://olympics.com/e"| __truncated__ ...
## $ rank_equal : logi [1:162804] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rank_position : chr [1:162804] "1" "2" "3" "4" ...
## $ country_name : chr [1:162804] "Italy" "Norway" "Sweden" "Great Britain" ...
## $ country_code : chr [1:162804] "IT" "NO" "SE" "GB" ...
## $ country_3_letter_code: chr [1:162804] "ITA" "NOR" "SWE" "GBR" ...
## $ athlete_url : chr [1:162804] NA NA NA NA ...
## $ athlete_full_name : chr [1:162804] NA NA NA NA ...
## $ value_unit : chr [1:162804] NA NA NA NA ...
## $ value_type : chr [1:162804] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. discipline_title = col_character(),
## .. event_title = col_character(),
## .. slug_game = col_character(),
## .. participant_type = col_character(),
## .. medal_type = col_character(),
## .. athletes = col_character(),
## .. rank_equal = col_logical(),
## .. rank_position = col_character(),
## .. country_name = col_character(),
## .. country_code = col_character(),
## .. country_3_letter_code = col_character(),
## .. athlete_url = col_character(),
## .. athlete_full_name = col_character(),
## .. value_unit = col_character(),
## .. value_type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I’ll perform some necessary format correction to the dataset due to simplify the future operations.
## spc_tbl_ [53 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ game_slug : chr [1:53] "beijing-2022" "tokyo-2020" "pyeongchang-2018" "rio-2016" ...
## $ game_end_date : POSIXct[1:53], format: "2022-02-20 12:00:00" "2021-08-08 14:00:00" ...
## $ game_start_date: POSIXct[1:53], format: "2022-02-04 15:00:00" "2021-07-23 11:00:00" ...
## $ game_location : chr [1:53] "China" "Japan" "Republic of Korea" "Brazil" ...
## $ game_name : chr [1:53] "Beijing 2022" "Tokyo 2020" "PyeongChang 2018" "Rio 2016" ...
## $ game_season : chr [1:53] "Winter" "Summer" "Winter" "Summer" ...
## $ game_year : num [1:53] 2022 2020 2018 2016 2014 ...
## - attr(*, "spec")=
## .. cols(
## .. game_slug = col_character(),
## .. game_end_date = col_datetime(format = ""),
## .. game_start_date = col_datetime(format = ""),
## .. game_location = col_character(),
## .. game_name = col_character(),
## .. game_season = col_character(),
## .. game_year = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
I’ll perform some necessary format correction to the dataset due to simplify the future operations.
## athlete_id games_participations first_game athlete_year_birth
## Length:41886 Min. : 1.000 Length:41886 Min. :1891
## Class :character 1st Qu.: 1.000 Class :character 1st Qu.:1972
## Mode :character Median : 1.000 Mode :character Median :1981
## Mean : 1.665 Mean :1981
## 3rd Qu.: 2.000 3rd Qu.:1990
## Max. :10.000 Max. :2009
## NA's :247
## gold silver bronze debut_age
## Min. : 0.0000 Min. :0.0000 Min. :0.0000 Min. :-61.00
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 21.00
## Median : 0.0000 Median :0.0000 Median :0.0000 Median : 24.00
## Mean : 0.1028 Mean :0.1001 Mean :0.1058 Mean : 24.45
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 27.00
## Max. :23.0000 Max. :6.0000 Max. :6.0000 Max. :122.00
## NA's :247
## age medalist athlete_full_name sex
## Min. : 16.00 Min. :0.000 Length:41886 Female:15087
## 1st Qu.: 35.00 1st Qu.:0.000 Class :character Male :24146
## Median : 44.00 Median :0.000 Mode :character NA's : 2653
## Mean : 44.14 Mean :0.186
## 3rd Qu.: 53.00 3rd Qu.:0.000
## Max. :134.00 Max. :1.000
## NA's :247
## height weight NOC_code
## Min. :133.0 Min. : 28.00 USA : 2446
## 1st Qu.:168.0 1st Qu.: 60.00 CAN : 1606
## Median :175.0 Median : 69.00 GER : 1568
## Mean :175.1 Mean : 70.33 FRA : 1499
## 3rd Qu.:182.0 3rd Qu.: 79.00 JPN : 1423
## Max. :217.0 Max. :214.00 (Other):30690
## NA's :9502 NA's :9502 NA's : 2654
## athlete_id games_participations first_game
## 0.000000000 0.000000000 0.000000000
## athlete_year_birth gold silver
## 0.005896958 0.000000000 0.000000000
## bronze debut_age age
## 0.000000000 0.005896958 0.005896958
## medalist athlete_full_name sex
## 0.000000000 0.000000000 0.063338586
## height weight NOC_code
## 0.226853841 0.226853841 0.063362460
From the analysis of the missing data and the distribution of the
variables, some problems stand out among the results: - Some athlete
debuts even before they were born. - Some athlete debuts after 75 y/o. -
Some athlete attent more than 5 edition, even reaching 10 (at least 40
years of activity). The first two problems probably come from an error
in the data collecting process, given that the debut age is calculated
by substracting the birth year of an athlete from the debut year.
Looking at the boxplot of athlete_year_birth is possible to see
that there’s an evident cluster of record that stand outside every
possible rare case (born before 1920). Now the problem could lie in the
bad collection of the athlete_year_birth or in the collection
of first_game. To verify this last possibility, I cross check
the athlete born before 1950 in the athlete.full dataset
with the result.full dataset.
Check for the oldest athletes
Check for the “super premature” athletes
## [1] 0.9125
The analysis over the birth years and the first game’s year revealed that: - For the oldest, that one that have a correspondence in the results dataset, the edition of 1992 was probably one of the latest, if not the last, and there isn’t any evidence for errors; - For the youngest, for athletes who recorded a debut age of less than 12, 91% do not find a correspondence in the results of the first Olympics that were recorded for them. The decision is to delete the athlete that were born before 1950 that don’t have a correspondence and the athletes with debut_age <= 12
Regarding the problem of the game_partecipations, I have checked and, even if very rare, there has been some very long-lived athlete (career talking), so nothing seem to be out of normal.
Regarding missing values, the ones coming from the
athlete dataset are very few (less than 1%), and for those
coming from the athlete.additional, they were probably not
available. I save the indexes of those missing values to keep track of
them and deal with them after according to the needs of the
analysis.
## discipline_title slug_game event_title event_gender
## 0.0000000 0.0000000 0.0000000 0.0000000
## medal_type participant_type NOC_code athlete_id
## 0.0000000 0.0000000 0.0000000 0.1951546
## athlete_full_name participant_title game_end_date game_start_date
## 0.1648454 0.7445361 0.0000000 0.0000000
## game_location game_name game_season game_year
## 0.0000000 0.0000000 0.0000000 0.0000000
From the analysis over the missing values stands out that the vast
majority of the missing values are in the participant_title
variable, but that does not represent a problem given that most of the
values of participant_title are the full name of
NOC_code. Also it doesn’t seem to be a relevant variable for
any of the aim of the analysis. The medals result that haven’t
associated any athlete/athletes to theme (those are the 1599 without
both athlete_fill_name and athlete_id), are related to
medals in “GameTeam” partitipant_type with more than 2
components, so I’ll set the athlete_full_name and
athlete_id as
“discipline_title-event_title-Team 2+”. In the end,
for those record that has an
NA only in
athlete_full_name the idea is to check the
athlete.full dataset and look if there’s a unique
correspondence between the name and the id.
## [1] 212
It seems that among the 212 athlete names without an ID, any of them
is present in the athlete.full dataset. For the medals
without an athlete associate I save the indexes just to know which one
they are
## discipline_title event_title slug_game participant_type
## 0.000000000 0.000000000 0.000000000 0.000000000
## medal_type rank_equal rank_position value_unit
## 0.887398841 0.759610807 0.008963507 0.606917691
## value_type NOC_code athlete_full_name athlete_id
## 0.547366099 0.000000000 0.073867661 0.105425709
## game_end_date game_start_date game_location game_name
## 0.000000000 0.000000000 0.000000000 0.000000000
## game_season game_year
## 0.000000000 0.000000000
In this dataset the missing values are to be treated regarding the type
of the variable. For example, in the variable medal_type a
missing values probably stands for None, also considering that the ~90%
of the data are missing. On the other end, the co-occurrence graph looks
very messy. I’ll try some others visualizations of the missing values
just to try to understand more.
With these additional analysis it seem that the missing values in the
variable rank_equal correspond to FALSE. For the
variables value_type and value_unit for the moment
doesn’t seem to be there any significant pattern, just probably lack of
the information from the source, but is too soon to tell. In the end,
for the athlete_id I’ll try to match with the
athlete.full dataset as done with
medals.full.
Now, leaving aside value_unit and value_type, the
missing values in the dataset are: - medal_type : NA means that
any medals has been won; - athlete_full_name and
athlete_id at the same time : those records correspond for
GameTeam results in which the group has more than 2 participant. I’ll
set the athlete_full_name and athlete_id as
“discipline_title-event_title-Team 2+”. -
athlete_id alone : NA means that that athlete doesn’t have any
id associate in the
athlete.full dataset. I, as before,
save the indexed of the data that are actually missing.
## slug_game game_end_date game_start_date
## Length:17 Min. :1992-02-23 19:00:00 Min. :1992-02-08 07:00:00
## Class :character 1st Qu.:1998-02-22 11:00:00 1st Qu.:1998-02-06 23:00:00
## Mode :character Median :2006-02-26 19:00:00 Median :2006-02-10 07:00:00
## Mean :2006-07-23 06:07:03 Mean :2006-07-07 03:03:31
## 3rd Qu.:2014-02-23 16:00:00 3rd Qu.:2014-02-07 04:00:00
## Max. :2022-02-20 12:00:00 Max. :2022-02-04 15:00:00
## game_location game_name game_season game_year
## Length:17 Length:17 Summer:8 Min. :1992
## Class :character Class :character Winter:9 1st Qu.:1998
## Mode :character Mode :character Median :2006
## Mean :2006
## 3rd Qu.:2014
## Max. :2022
## tibble [17 × 7] (S3: tbl_df/tbl/data.frame)
## $ slug_game : chr [1:17] "beijing-2022" "tokyo-2020" "pyeongchang-2018" "rio-2016" ...
## $ game_end_date : POSIXct[1:17], format: "2022-02-20 12:00:00" "2021-08-08 14:00:00" ...
## $ game_start_date: POSIXct[1:17], format: "2022-02-04 15:00:00" "2021-07-23 11:00:00" ...
## $ game_location : chr [1:17] "China" "Japan" "Republic of Korea" "Brazil" ...
## $ game_name : chr [1:17] "Beijing 2022" "Tokyo 2020" "PyeongChang 2018" "Rio 2016" ...
## $ game_season : Factor w/ 2 levels "Summer","Winter": 2 1 2 1 2 1 2 1 2 1 ...
## $ game_year : num [1:17] 2022 2020 2018 2016 2014 ...
Now I’ll visualize the data and try to extract some relevant insight.
## [1] "China TRUE"
## [1] "Japan TRUE"
## [1] "Republic of Korea FALSE"
## [1] "Brazil TRUE"
## [1] "Russian Federation FALSE"
## [1] "Great Britain FALSE"
## [1] "Canada TRUE"
## [1] "Italy TRUE"
## [1] "Greece TRUE"
## [1] "United States FALSE"
## [1] "Australia TRUE"
## [1] "Norway TRUE"
## [1] "Spain TRUE"
## [1] "France TRUE"
After 1992 the Olympic Games has been hosted spread all over the
continents, except for Africa, with USA, Japan and China elected two
times to be the hosting region.
As shown by the image, in the period after 1992 there has been an undisputed dominance by, in order, USA, China, Russia and Germany, follow by UK, Australia, France, Japan and Italy, regarding the summer editions. Indeed, looking at the winter editions, the podium change a bit, with the Germany as absolute ruler, followed by Norway and USA. Brilliant performance has been achieved also by Canada, Austria, Russia and Italy.
Looking at the evolution of the performance of the nations over the time, regarding the summer editions it is possible to observe a remarkable uptrend for UK, Japan and, a little bit less, China, while Germany shows a very significant downtrend. These patterns are not casual: the rise of Great Britain is strongly linked to the consistent investment plan launched after Sydney 2000 and reinforced towards London 2012, while Japan’s progression is connected to the preparation for Tokyo 2020 and a long-term investment in sports science. China’s boost reflects the national strategy to use sport as a symbol of prestige, with a clear turning point around Beijing 2008. On the contrary, Germany’s decline after the early ’90s is strongly related to the end of the centralized and highly efficient sports system of East Germany, which before the reunification guaranteed very high results. Therefore, is evident the undisputed dominance of USA across the years, based on an enormous pool of athletes, strong university sports programs and the ability to remain competitive in a wide variety of disciplines. In the last 20 years, also China emerges as a stable counterpart in the top positions.
Instead, looking at the winter games, a remarkable uptrend is highlighted by USA and Canada and, in the last 20 years, by Norway until nowadays domination. These dynamics can be explained by the rise of new winter disciplines (snowboard, freestyle) where USA have a cultural and commercial advantage, and by the Canadian program Own the Podium launched for Vancouver 2010, which gave a permanent boost. Norway’s extraordinary growth is the result of a long-lasting tradition in Nordic sports, combined with an ethical and sustainable approach that avoided the collapse seen in other countries after doping scandals. Otherwise the summer editions, here the dominance of Germany isn’t as evident as for the USA. In fact, in the short term the pole position clearly shifted to Norway, but looking at the cumulative number of medals Germany still keeps the overall lead — a lead that seems however increasingly under threat and likely to be surpassed in the next editions.
Let’s inspect where the current dominating nations gain their power (USA, China and UK for the summer editions and Norway and Germany for the winter)
Regarding the summer editions USA have, on average, won around ~11.5
medals per discipline thanks to the absolute domination in athletic and
swimming races and, less, thanks to gymnastics artistic and wrestling.
This concentration is not casual: athletics and swimming provide many
medal opportunities and the USA, with their huge pool of athletes, the
strong college sport system and world-class infrastructures, have been
able to consistently transform quantity into quality. China, with a
lower average of ~9.55 medals per discipline, has been competitive, but
not disruptive as USA in athletics and swimming, and has instead built
its strength on a very wide variety of sports such as badminton, diving,
shooting, table tennis, gymnastics artistic and weightlifting. This
reflects a precise national strategy: investing on sports with many
medal events or less global competition, supported by centralized talent
identification and early specialization. UK, with an even lower average
of ~6 medals per discipline, has been competitive in athletics, cycling
track, rowing, sailing and swimming. Its success is the result of
deliberate investments after Sydney 2000 and especially in preparation
for London 2012, focusing on disciplines with strong national tradition
and where technology and innovation could make the difference.
Looking at the winter editions, stands out a very similar situation: Norway, with an average of ~13.28 medals per discipline, owes its growing dominance mainly to cross country skiing, biathlon and alpine skiing, with some decent result also in nordic combined, ski jumping and speed skating. The reason for such supremacy lies in the cultural roots of skiing in Norway, practiced massively at every level, and in a federative system oriented to long-term and sustainable athlete development. Germany, with an average of ~10 medals per discipline, has been less dominant as Norway, but has been competitive across multiple disciplines such as alpine skiing, biathlon, bobsleigh, cross country skiing, luge, nordic combined, ski jumping and speed skating. The German model is more diversified than concentrated, reflecting a strong tradition in sliding sports and a well-developed infrastructure, which has guaranteed a broad but less absolute competitiveness.
The single performance of the athletes reflects what has been observed
before. For Norway, Marit Bjørgen (cross country skiing), Ole Einar
Bjørndalen (biathlon), Bjørn Dæhlie (cross country skiing) and Kjetil
André Aamodt (alpine skiing) have performed out of the ordinary,
considered statistically as outliers, and their records symbolize the
cultural centrality of winter sports in the country. Bjørgen is the most
decorated Winter Olympian of all time, while Bjørndalen, known as the
“King of Biathlon”, occupied the second place. Aamodt, is still the most
successful alpine skier in Olympic history. The USA show a similar trend
of outstanding outliers, mainly in swimming and athletics: Michael
Phelps is the most decorated Olympian ever, while Katie Ledecky has
become the most dominant female swimmer in history. Allyson Felix is the
most decorated American track and field athlete, and together with names
like Ryan Lochte, Shannon Miller, Aaron Peirsol, Amanda Beard, Natalie
Coughlin and Gary Hall Jr., they embody the American dominance across
multiple cycles of the Games. On the other hand, for Germany and China
the distribution doesn’t denote such extreme outliers, even if
remarkable athletes have left a significant mark: for Germany, Uschi
Disl (biathlon) and Katja Seizinger (alpine skiing) are the most notable
names, while many medals also came from collective events in sliding
sports. For China, instead, the dominance has been more distributed,
reflecting the national strategy: nevertheless, some athletes stand out
for their consistency across multiple Games. In diving, Wu Minxia, Guo
Jingjing, Qin Kai, Fu Mingxia, Chen Ruolin and Cao Yuan embody the
tradition of excellence that made China the global powerhouse in this
discipline. In addition, Wang Yifu in shooting and Sun Yang in swimming,
have been central figures of the Chinese success, showing how the
country’s performance, though less concentrated on single “outliers”
like the USA or Norway, has been built on a solid core of
multi-medalists across different disciplines.
In this paper I’m going to explore, given the past performance of the athlete that took part at the Olympic games, what does it takes to be an Olympic medalist in the winter editions (like the 2026 Milano-Cortina Games). Then I’ll try to estimate the chances that the Italian athletes have to win a medal.
I create now the full dataset that I’m going to use after starting
from athlete.full and results.full datasets. I
immidiately remove the irrelevant variabiles and create some new
“indicators” that might be usefull in the analysis: - Medalist:
wheather the results correspond to a medal or not; - HomeGame:
wheather the athlete was competing in is own nation or not; -
participation_number: the cumulative participations for each
athlete;
Furthermore I’ll work only on the winter edition, accordingly to the aim of the study.
##
## 0 1
## 0.93 0.07
##
## 0 1
## 0.93 0.07
##
## 0 1
## 0.92 0.08
After have divided the dataset into the necessary subset and have the same proportion, I proceed with the analysis of the missing values.
## tibble [13,982 × 15] (S3: tbl_df/tbl/data.frame)
## $ discipline_title : Factor w/ 86 levels "3x3 Basketball",..: 68 13 49 36 68 2 21 70 70 21 ...
## $ event_title : chr [1:13982] "Giant parallel slalom men" "Women's 15km Individual" "Singles men" "Pairs mixed" ...
## $ participant_type : Factor w/ 2 levels "Athlete","GameTeam": 1 1 1 2 1 1 1 1 1 1 ...
## $ game_location : chr [1:13982] "ITA" "CHN" "USA" "CAN" ...
## $ game_season : Factor w/ 2 levels "Summer","Winter": 2 2 2 2 2 2 2 2 2 2 ...
## $ HomeGame : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Medalist : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ athlete_year_birth : num [1:13982] 1984 1997 1969 1986 1993 ...
## $ debut_age : num [1:13982] 22 25 23 20 21 23 21 19 23 20 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 1 2 1 ...
## $ height : int [1:13982] 185 NA 178 159 170 170 NA 164 178 168 ...
## $ weight : num [1:13982] 84 NA 88 44 62 73 NA 70 73 53 ...
## $ NOC_code : Factor w/ 230 levels "AFG","AHO","ALB",..: 82 184 133 174 192 38 113 109 217 109 ...
## $ result_age : num [1:13982] 22 25 33 24 21 23 25 23 27 20 ...
## $ participation_number: int [1:13982] 1 1 4 2 1 1 2 2 2 1 ...
## discipline_title event_title participant_type
## Cross Country Skiing:2901 Length:13982 Athlete :12749
## Alpine Skiing :2797 Class :character GameTeam: 1233
## Biathlon :1969 Mode :character
## Speed skating :1458
## Snowboard : 788
## Freestyle Skiing : 775
## (Other) :3294
## game_location game_season HomeGame Medalist athlete_year_birth
## Length:13982 Summer: 0 0:13220 0:12972 Min. :1945
## Class :character Winter:13982 1: 762 1: 1010 1st Qu.:1973
## Mode :character Median :1982
## Mean :1982
## 3rd Qu.:1990
## Max. :2006
##
## debut_age sex height weight NOC_code
## Min. :13.00 Female:5663 Min. :136.0 Min. : 30.00 USA : 948
## 1st Qu.:21.00 Male :7810 1st Qu.:168.0 1st Qu.: 60.00 CAN : 842
## Median :23.00 NA's : 509 Median :174.0 Median : 68.00 ITA : 741
## Mean :23.28 Mean :174.3 Mean : 69.39 GER : 718
## 3rd Qu.:25.00 3rd Qu.:181.0 3rd Qu.: 78.00 FRA : 663
## Max. :50.00 Max. :216.0 Max. :127.00 (Other):9560
## NA's :2086 NA's :2086 NA's : 510
## result_age participation_number
## Min. : 0.00 Min. :1.000
## 1st Qu.:23.00 1st Qu.:1.000
## Median :26.00 Median :1.000
## Mean :26.24 Mean :1.675
## 3rd Qu.:29.00 3rd Qu.:2.000
## Max. :51.00 Max. :8.000
##
## discipline_title event_title participant_type
## 0.00000000 0.00000000 0.00000000
## game_location game_season HomeGame
## 0.00000000 0.00000000 0.00000000
## Medalist athlete_year_birth debut_age
## 0.00000000 0.00000000 0.00000000
## sex height weight
## 0.03640395 0.14919182 0.14919182
## NOC_code result_age participation_number
## 0.03647547 0.00000000 0.00000000
Regarding the missing values, the NA’s in the age variables (debut_age, result_age and athlete_year_birth) are all related to lack of information from the source, but given that they’re only less than 1% of the total, I’ll simply won’t consider them. The presence of NA’s in NOC_code doesn’t represent a problem given that I wont’t probably use that variable in the algorithm. Regarding sex, height and weight, they’re missing due to lack of the information from the source. Given that they’re only the ~5% and ~17% of the total, I’ll impute them with the stratified by sex median for height and weight and sex with the mode of the class.
## P value debut_age
## [1] 2.594079e-06
## P value height
## [1] 0.9034867
## P value weight
## [1] 0.1389711
## P value participation_number
## [1] 8.943577e-28
## P value result_age
## [1] 1.291988e-13
## P value participant_type
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 75.521, df = 1, p-value < 2.2e-16
##
## P value NOC_code
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = NaN, df = 229, p-value = NA
##
## P value HomeGame
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 7.84, df = 1, p-value = 0.00511
##
## P value sex
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 7.601, df = 1, p-value = 0.005834
Given the results, the variables that seem correct to consider are debut_age, height, weight, participation_number, participant_type, sex and HomeGame. From now I’ll consider only theese variables. Also in general there aren’t any particolar multicollinearity, if not between Height and Weight and debut_age and result_age.
Let’s inspect the possbile presence of outlier
## $debut_age
##
## $height
##
## $weight
##
## $participation_number
## $debut_age
##
## $height
##
## $weight
##
## $participation_number
## $debut_age
##
## $participation_number
## $debut_age
##
## $participation_number
From the distribution plots doesn’t seem to be any particular outlier. Furthermore, after the log trasformations, the normality seem to be respected by all of the variables and the variance and covariance seem to be shared by the classes.
It’s immediately apparent that the variables are distributed across
significantly different scales. To compare them, it’s useful to
standardize them all onto the same scale.
Given that the aim of the project is more oriented to be descriptive, I’ll only evaluate the classification with logistc regression and with a decision tree.
##
## Call:
## glm(formula = Medalist ~ ., family = binomial, data = sub_train.w[,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5126429 0.0662234 -37.942 < 2e-16 ***
## debut_age -0.1095616 0.0353328 -3.101 0.00193 **
## height 0.0004107 0.0611965 0.007 0.99465
## weight 0.1528571 0.0570996 2.677 0.00743 **
## participation_number 0.3763478 0.0315043 11.946 < 2e-16 ***
## participant_typeGameTeam 0.8182707 0.0945714 8.652 < 2e-16 ***
## sexMale -0.4105080 0.0967102 -4.245 2.19e-05 ***
## HomeGame1 0.3613143 0.1283715 2.815 0.00488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7253.4 on 13981 degrees of freedom
## Residual deviance: 6987.6 on 13974 degrees of freedom
## AIC: 7003.6
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number +
## participant_type + sex + HomeGame, family = binomial, data = sub_train.w[,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.51278 0.06310 -39.825 < 2e-16 ***
## debut_age -0.10956 0.03533 -3.101 0.001928 **
## weight 0.15311 0.04259 3.595 0.000324 ***
## participation_number 0.37635 0.03150 11.948 < 2e-16 ***
## participant_typeGameTeam 0.81821 0.09415 8.690 < 2e-16 ***
## sexMale -0.41027 0.08997 -4.560 5.11e-06 ***
## HomeGame1 0.36129 0.12831 2.816 0.004867 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7253.4 on 13981 degrees of freedom
## Residual deviance: 6987.6 on 13975 degrees of freedom
## AIC: 7001.6
##
## Number of Fisher Scoring iterations: 5
Let’s check if there’s any influence point that could introduce bias in the model
Let’s remove those point and calculate again the model
##
## Call:
## glm(formula = Medalist ~ ., family = binomial, data = sub_train.w[-influence_idx,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.50363 0.06618 -37.832 < 2e-16 ***
## debut_age -0.11900 0.03547 -3.355 0.000793 ***
## height 0.01161 0.06131 0.189 0.849819
## weight 0.15559 0.05719 2.721 0.006515 **
## participation_number 0.37820 0.03159 11.972 < 2e-16 ***
## participant_typeGameTeam 0.82516 0.09486 8.699 < 2e-16 ***
## sexMale -0.43633 0.09694 -4.501 6.77e-06 ***
## HomeGame1 0.33126 0.13047 2.539 0.011116 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7227.0 on 13975 degrees of freedom
## Residual deviance: 6957.5 on 13968 degrees of freedom
## AIC: 6973.5
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number +
## participant_type + sex + HomeGame, family = binomial, data = sub_train.w[-influence_idx,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.50743 0.06310 -39.739 < 2e-16 ***
## debut_age -0.11890 0.03546 -3.353 0.000799 ***
## weight 0.16280 0.04263 3.819 0.000134 ***
## participation_number 0.37829 0.03159 11.976 < 2e-16 ***
## participant_typeGameTeam 0.82345 0.09443 8.720 < 2e-16 ***
## sexMale -0.42959 0.09019 -4.763 1.9e-06 ***
## HomeGame1 0.33051 0.13040 2.535 0.011259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7227.0 on 13975 degrees of freedom
## Residual deviance: 6957.5 on 13969 degrees of freedom
## AIC: 6971.5
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7026 0
## 1 504 0
##
## Accuracy : 0.9331
## 95% CI : (0.9272, 0.9386)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9331
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1.0000
## Detection Rate : 0.9331
## Detection Prevalence : 0.9331
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
As shown by the results, the threshold of .5 doesn’t not perfom at all, let’s evaluate different values of threshold and see which one is the best.
## [1] 0.001
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 7026
## 1 0 504
##
## Accuracy : 0.0669
## 95% CI : (0.0614, 0.0728)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : NA
## Specificity : 0.06693
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.00000
## Detection Rate : 0.00000
## Detection Prevalence : 0.93307
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
## [1] 0.05
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2689 4337
## 1 103 401
##
## Accuracy : 0.4104
## 95% CI : (0.3992, 0.4216)
## No Information Rate : 0.6292
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0364
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96311
## Specificity : 0.08463
## Pos Pred Value : 0.38272
## Neg Pred Value : 0.79563
## Prevalence : 0.37078
## Detection Rate : 0.35710
## Detection Prevalence : 0.93307
## Balanced Accuracy : 0.52387
##
## 'Positive' Class : 0
##
## [1] 0.1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5820 1206
## 1 327 177
##
## Accuracy : 0.7964
## 95% CI : (0.7871, 0.8055)
## No Information Rate : 0.8163
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0992
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9468
## Specificity : 0.1280
## Pos Pred Value : 0.8284
## Neg Pred Value : 0.3512
## Prevalence : 0.8163
## Detection Rate : 0.7729
## Detection Prevalence : 0.9331
## Balanced Accuracy : 0.5374
##
## 'Positive' Class : 0
##
## [1] 0.15
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6746 280
## 1 445 59
##
## Accuracy : 0.9037
## 95% CI : (0.8968, 0.9103)
## No Information Rate : 0.955
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.091
##
## Mcnemar's Test P-Value : 1.123e-09
##
## Sensitivity : 0.9381
## Specificity : 0.1740
## Pos Pred Value : 0.9601
## Neg Pred Value : 0.1171
## Prevalence : 0.9550
## Detection Rate : 0.8959
## Detection Prevalence : 0.9331
## Balanced Accuracy : 0.5561
##
## 'Positive' Class : 0
##
## [1] 0.2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6949 77
## 1 492 12
##
## Accuracy : 0.9244
## 95% CI : (0.9182, 0.9303)
## No Information Rate : 0.9882
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0208
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.93388
## Specificity : 0.13483
## Pos Pred Value : 0.98904
## Neg Pred Value : 0.02381
## Prevalence : 0.98818
## Detection Rate : 0.92284
## Detection Prevalence : 0.93307
## Balanced Accuracy : 0.53436
##
## 'Positive' Class : 0
##
## [1] 0.3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7024 2
## 1 502 2
##
## Accuracy : 0.9331
## 95% CI : (0.9272, 0.9386)
## No Information Rate : 0.9995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0068
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.933298
## Specificity : 0.500000
## Pos Pred Value : 0.999715
## Neg Pred Value : 0.003968
## Prevalence : 0.999469
## Detection Rate : 0.932802
## Detection Prevalence : 0.933068
## Balanced Accuracy : 0.716649
##
## 'Positive' Class : 0
##
## [1] 0.4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7026 0
## 1 504 0
##
## Accuracy : 0.9331
## 95% CI : (0.9272, 0.9386)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9331
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1.0000
## Detection Rate : 0.9331
## Detection Prevalence : 0.9331
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
## [1] 0.5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7026 0
## 1 504 0
##
## Accuracy : 0.9331
## 95% CI : (0.9272, 0.9386)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9331
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1.0000
## Detection Rate : 0.9331
## Detection Prevalence : 0.9331
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
The best threshold seem to be .3 ### Tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 13982 7253.00 0 ( 0.927764 0.072236 )
## 2) participation_number < -0.100224 7737 2966.00 0 ( 0.952307 0.047693 )
## 4) participant_type: Athlete 7016 2503.00 0 ( 0.956670 0.043330 ) *
## 5) participant_type: GameTeam 721 436.80 0 ( 0.909847 0.090153 ) *
## 3) participation_number > -0.100224 6245 4132.00 0 ( 0.897358 0.102642 )
## 6) participant_type: Athlete 5733 3583.00 0 ( 0.905634 0.094366 )
## 12) debut_age < -0.151209 3092 2184.00 0 ( 0.886805 0.113195 )
## 24) height < 0.109535 1748 1373.00 0 ( 0.866705 0.133295 ) *
## 25) height > 0.109535 1344 794.80 0 ( 0.912946 0.087054 ) *
## 13) debut_age > -0.151209 2641 1371.00 0 ( 0.927679 0.072321 )
## 26) weight < 1.3013 2328 1097.00 0 ( 0.936856 0.063144 )
## 52) debut_age < 0.937697 1829 927.70 0 ( 0.930016 0.069984 ) *
## 53) debut_age > 0.937697 499 161.50 0 ( 0.961924 0.038076 )
## 106) weight < 0.0839942 293 129.80 0 ( 0.941980 0.058020 ) *
## 107) weight > 0.0839942 206 22.52 0 ( 0.990291 0.009709 ) *
## 27) weight > 1.3013 313 254.20 0 ( 0.859425 0.140575 ) *
## 7) participant_type: GameTeam 512 505.70 0 ( 0.804688 0.195312 ) *
Given the strong imbalance of the classes (it is very rare to win a
medal) the decision tree does not perform well at all, always
classifying 0. So I proceed to evaluate and confirm the perfonces of the
logistic regression.
##
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number +
## participant_type + sex + HomeGame, family = binomial, data = train.w[,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.48071 3.87564 -4.510 6.47e-06 ***
## debut_age -2.84465 0.68815 -4.134 3.57e-05 ***
## weight 0.12752 0.03505 3.638 0.000275 ***
## participation_number 0.73669 0.04988 14.770 < 2e-16 ***
## participant_typeGameTeam 0.83708 0.07644 10.950 < 2e-16 ***
## sexMale -0.39378 0.07346 -5.361 8.30e-08 ***
## HomeGame1 0.40354 0.10279 3.926 8.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10955 on 21511 degrees of freedom
## Residual deviance: 10551 on 21505 degrees of freedom
## AIC: 10565
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = Medalist ~ debut_age + weight + participation_number +
## participant_type + sex + HomeGame, family = binomial, data = train.w[-influence_idx,
## c(relevant_var.numeric.w, relevant_var.vategorical.w, "Medalist")])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.36794 3.88542 -4.727 2.27e-06 ***
## debut_age -3.00408 0.68987 -4.355 1.33e-05 ***
## weight 0.13338 0.03508 3.802 0.000143 ***
## participation_number 0.73922 0.04997 14.793 < 2e-16 ***
## participant_typeGameTeam 0.84112 0.07660 10.981 < 2e-16 ***
## sexMale -0.40586 0.07358 -5.516 3.47e-08 ***
## HomeGame1 0.38447 0.10385 3.702 0.000214 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10928 on 21505 degrees of freedom
## Residual deviance: 10521 on 21499 degrees of freedom
## AIC: 10535
##
## Number of Fisher Scoring iterations: 5
## discipline_title event_title participant_type
## Alpine Skiing :1121 Length:5378 Athlete :4892
## Cross Country Skiing:1110 Class :character GameTeam: 486
## Biathlon : 741 Mode :character
## Speed skating : 521
## Snowboard : 342
## Freestyle Skiing : 315
## (Other) :1228
## game_location game_season HomeGame Medalist athlete_year_birth
## Length:5378 Summer: 0 0:5063 0:4972 Min. :1950
## Class :character Winter:5378 1: 315 1: 406 1st Qu.:1973
## Mode :character Median :1982
## Mean :1982
## 3rd Qu.:1990
## Max. :2008
##
## debut_age sex height weight NOC_code
## Min. :14.00 Female:2233 Min. :136.0 Min. : 30.00 USA : 394
## 1st Qu.:21.00 Male :2974 1st Qu.:168.0 1st Qu.: 60.00 CAN : 328
## Median :23.00 NA's : 171 Median :174.0 Median : 68.00 ITA : 286
## Mean :23.34 Mean :174.2 Mean : 69.44 GER : 256
## 3rd Qu.:25.00 3rd Qu.:180.0 3rd Qu.: 78.00 FRA : 241
## Max. :47.00 Max. :204.0 Max. :125.00 (Other):3702
## NA's :790 NA's :790 NA's : 171
## result_age participation_number
## Min. :14.00 Min. :1.00
## 1st Qu.:23.00 1st Qu.:1.00
## Median :26.00 Median :1.00
## Mean :26.21 Mean :1.66
## 3rd Qu.:29.00 3rd Qu.:2.00
## Max. :51.00 Max. :8.00
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4965 7
## 1 405 1
##
## Accuracy : 0.9234
## 95% CI : (0.916, 0.9304)
## No Information Rate : 0.9985
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0019
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.924581
## Specificity : 0.125000
## Pos Pred Value : 0.998592
## Neg Pred Value : 0.002463
## Prevalence : 0.998512
## Detection Rate : 0.923206
## Detection Prevalence : 0.924507
## Balanced Accuracy : 0.524791
##
## 'Positive' Class : 0
##
## [[1]]
## [1] 0.6524198
The model does’t seem to perform very well but can give us an idea about what does it it takes to win a medal. In particular it highlights that winning an Olympic medal is not a matter of chance but the result of several measurable factors. Between those the most remarkable are: - Athletes debuting at a younger age have a higher probability of reaching the podium, reflecting the importance of early specialization and long-term career development; - The number of participations is one of the strongest predictors: experience accumulated across multiple Games significantly increases the chances of success; - Being part of a GameTeam rather than competing individually also improves the odds, as collective events tend to guarantee more stable performances. - Competing at home (HomeGame) offers a tangible advantage, confirming the existence of a “home effect” in the Olympic Games.